DOCOHXII BXSOHB 



ID 135 841 



95 



IB 006 068 



lOIBOB 
IIIXB 

IBSIIIOTIOS 
SV09S A6BSCX 

BBCOBI SO 
FOB Dili 
COBIBICI 
BOIE 

BDBS PBICE 
OBSCBIPTOBS 



IDBMTIFIBBS 



Koseccff, Jacgueline; Fink, Arlene 

Ibe Apfrofriateness of Criterion*Be£ereiiced Tests for 
Bvaloation Studies* 

BBIC ClearingboQse on lests, H^asoreaent, and 
Bvalnation, Princeton, B«J* 

BatioQal Inst* of Education (DBBH) , Basbington, 
D*C* 

BaiC*IB-60 
Dec 76 
400-75-0015 

H7*$0*83 BC-$;(«06 Plus Postage* 

Criteria; ^Criterion Beferenced Tests; Feasibility 
Studies; Frograa Effectiveness; *Prograa Svaloatipn; 
Test Ccnstrnction; Testing Problems; *Test BevieitB; 
Test selection; Test falidity 
*Xarge Scale Bvaloation 



IBSTBICT 

Tbis report represents the results of an 
investigation of tbe appropriateness of cxiter ion- referenced teste 
(CBTS) for large-scale evaloaticns* First, tbe development and 
validation of CBIS, indoding tbe forsolation and generation of CBT 
olijectives, itess# and score- inter pre tat ion sc&eses and dimensions of 
item and test goality, vere examined to determine mbetlier on 
tbeoretical arcnnds alone CBTS are suitable for large-scale 
evaluations* Second, tbe practical cbaxacteristics of CBTS itere 
studied to determine if it is feasible to nse currently available 
CBTS for large-scale evaluations* 1 set of criteria for selecting 
tests for evaluation purposes itas devised and used to revieit 28 CBTS* 
1 conclusion itas reacbed tbat CBTS are not appropriate for use in 
large-scale evaluations for practical but not tbeoretical reasons* 
(Autbor) 



* Documents acguired by SBIC include many informal unpublisbed 

* materials not available from otfibr sources* BBIC makes every effort 

* to cbtain tbe best copy available* Severtbeless, items of marginal 

* reproducibility are often encountered and tbis affects tbe quality 

* of tbe microficbe and bardcopy reproductions BBIC makes available 

* via tbe BBIC Document Beproduction service (BDBS) * EDBS is not 

* responsible for tbe guality of tbe original document* Beproductionn 

* supplied by SCB3 are tbe best tbat can be made from tbe original* 
*************************************************************** «ij^***** 



ERIC CIJEARINCHOUSE ON TESTS, MEASUREMENT, & EVALUATION 
EDUCATIONAL TESTING SERVia, PRINCETON, NEW fERSEY 08540 



TM REPORT 60 



CO 
lA 

O 



THE APPROPRIATENESS OF CRITERION-REFERENCED 
ISSTS FOR EVALUATION STUDIES* 



JftcqneUiie Koiecoffand AtkneFbk 



DECEMBER 1976 



cducation 

This DCCl^MCMT has aCCM ftCPRO> 
GOCCO CXaCTLV as RCCCIvCD FflOM 
ThC PCRSGM Oft OROAMlZATlOMDftlOlM* 
ATIMC IT POlMTSOf vlCWOROpm^OMS 
STATED DO MOT MCCCSS^RItV ftCPRS* 

SCMT Official matiomal imuiti^tc of 
cducatiom positiom or f^LlCV 



ABSTRACT 

Thift rvpoit r^i«aents the imilts of an invMt^^ 

refavnced tesU <cit9) for Itige^scak evaloatkms* First* the development end velida- 
turn of CKT8, iftfl'*^^"g ^ fommhtion and gcpeiatkm of cn ot^jectives, iteme, and 
acmc^ipterpietatiopechMneeattdrtiiiwneioMwit^ 



determine whether on theoretical groamls akme cm are eoitable for large-Mle evahii- 
tioQS. Second, the ^cticdchaiadwiatics of COTS WW stpdiedt^ 
ibie to uee currently available one for latge-ecale evahutkme. 

A set of criteria for aelectitig tests for evaluation porpoeea was devteed and ueed to 
review 28 c»t3* A coochmion was reached that care are not appropriate for uee in large* 
scale evatuations for fff&ctical but not theoretical reasooe. 



INTRODUCTION 



00 



Cr]terion*Teferenced tests are becoming incieasingty popu* 
lar among educators and pscychometrictans. Perhaps the 
most important reason for their appearance and wide* 
spread acceptance can be traced to the new ways that had 
to be found to measure the effects of the educational re* 
forms of the IdSOs and 1960s. During those decades, the 
conventional school curriculum was declared in need of re* 
form, and a reassessment of the goals and objectives of 
American education was made (19^ 7^ 6). Innovative 
courses of stu<ib^ and instructional technologies were sub* 
sequent^ developed^ and programmed learning and indi- 
vidualized instruction became common teaching ap- 
proaches. New ways of assessing student performance 
wm neoded that corresponded to these teaching innova- 
tions. 

Educators have traditionally relied on paper*and'pencil 
achievement tests to measure learning, so it was natural 
for them to turn to test theoreticians to provide th^ with 
alternative ways of interpreting performance oil measures 
of educational achievement for the new curriculums and 
Rwthods of instruction. The psychometricians responded 
by pointing to two basic ways of assigning meaning to test 
scores. The first involved comparit^ the performance or 
behavior of onz person or group with another person oi; 

rrb* luthm wbh to thank Sy«t«in Development C6n»mtlon fbr pio- 
vidin^ support for this Btudy* 



gTQup^ and the second involved describing what a person 
or group can do or can be expected to do. Glaser (10) le* 
ferred to these two ways of giving meaning to test scores 
as ''norm*referenced^' and "ciiterion*refevenced,^' and 
recommended criterion*referenced score tnterpr^ations for 
the reformed ctirricuhuns and instruction. 

The reaction to criterion^referenced tests (carsl was en- 
thusiastic from the start. Because they provKlescore int^- 
pretattons in tenns of the achievement of specific and 
measurable skills and behaviors, cats have appealed to 
those duecUy responsible for the educatK>n of students 
and the development and evaltiatk>n of educational pro* 
grams. They have also appealed to teachers who found the 
results of standardized tests inadequate to assist them In 
planning lessons and to many educators and psychologists 
who judged standardized^ norm*referenced tests to be 
unfair and even biased against Individuals bom vaidet* 
privileged and minority groups. 

The interest In cars demonstrated by both theoreticians 
and practitioners has led to their frequent use for instruc* 
tional diagnosis and placement and for measuring stud^t 
achievement of educational tasks or objectives and profes* 
sional or occupational licensure or credentials. In addition^ 
CRTS are being suggested or used for other purposes, such 
as the evaluation of educational programs and the National 
Assessment of Educational Progiess (32). 



Them»lenaJinthi«puMicatk>nw«.'i prepared pufsunnt to a contra^ the Nition^ f nnituto of Educatton. VS Oe|MUtm«nt of Health, Kducatibn and 
Welfare. Contractors undertnking »uch projects under government sponsorship are encouraged to express freely their judgment in professional and tech* 
nicai matters. Prior to publJcatton. thenianuscr^pt was siU^mfticd to qualinedproressk>nals for critical nvie^ 

litis publication has met such standards. Points of view or opinions, however, do rwt necessarily represent the official view or opinions of either these 
vNnvers or the National Institute of F^ducation. ^ 



TBEPUHPOSESOrPBOGRAHEVAUTATION 



gram's mffft and ptwidm ittktmatim about tba nature 
and quaUtyoltbapFOsniD'a goals, outcomes, impact, szkI 
costs (9K 

There are two contexts in which evahiatioits of educa* 
tkAa^pl!ogvam^ are conducted; In om* an svatuation^ir 
emlucted to improve a progiamt and the evahtatkm's 
dients are typical^ tbspropmVorgattieen and sCaffrln^ 
the second, an evahtatkm is condoeted to measure tbs 
tfftcHvfmss at a pfogranit and tbs smhiatioaV dients «» 
typically tbs program's sponsois. The context far an aval- 
uiikm is dHennined 1^ tbs informatiott needs o( thri^ 
viduals and agencies who most use the evakatkm tn- 
fcmnatiOD. 

MevaluatioD is perCDrmsd in an in^trovement omtext 
when tbs evaluation's clients are concetned with findbig 
out ineds^ wfieiedr if a change wotdd m^ tbs pcogmm 
better. IVP^c'Uyttneorganusers ofa stiU*4evelopii9 prb*^ 
gram requue this land of mfotmation so that tii^ can 
modify and imptove the pragrun. On the other handt an 
evaluation is conducted in an efitBctivaiess context when 
the evahiation's dients are particulate oonconed with 
detennining the conmstenqr and efficient with wluch the 
program achieves desired results. Those individuals who 
sponsored |»ogram development or who are tntorested in 
u^ng tbs program requke this fciiid of infMrmat^ 
wdl*establt9hed program's outcomes and impact. 

In an ^fectiveness evahiatton, the evahiator usually 
assumes a more global and independent rtanoe toward the 
IHOgram than in an luipfovexnent context. In addititmt the 



evahiator usuaSy malcea use of powerful, experimental 
design sttategies that pcnnit oemparisoiiSt nBf^ on 
emphicalty valniated and atmdardiiad htftnimentSp and 
employs stat^fftk«^and other analytic methods that allow 
mfieienoea regarding tbs imgvam'a omyarative inAie. 

Evahtationeof edncatioMal pvogmms can be cbndttcted 
for a singte classroom; a grade la^t a scftoot, a ^strict^ a 
statSt and/orfor tbseotira nat&NL Laistisadrevahiatiaoa 
encowpaes greet antnberi-ol-stnd«>ts and fteqaeiftiy-ki^ 
dudfr mai^ schoobt sevnal gmkst diChrent (fisMcta 
or states. 



A Stady <rf CRTs and Lafge-Scate EvahiatieM 

This report presents tbs resoks o( an ^tvest^ation of the 
sppropriateDasa^ot usb^g^ crte^iefiamed testa for 
large«alegvahiations caDdnctedio an elfactivanaes-eval* 
uatidn context Ilia inveatiiptiQit b^^ 
tlMovy that nnderiiee tte dewdbf^^ validation o( 
c»n to dstmnihe wbsthar; do theoretical *grsttnd^ alonSt 
ens are suitable or not suitable for Urge-ecale efCactive- 
ness evafaiaticMis. Tbs not step was to develop a sst <4 
criteria for selecting tests that m appr o p riate for such 
evalttaticms. Included witbir^ the set of criteria was tbs 
s^puUticm that the test be ahMf to provide soc«eeamaiahle 
to caff interpretatioii. Available cats were then reviewedt 
U3^ the identified set at criteria. VintJSf, coodusiboa 
were drawn based on the thecmtical examination and tbs 
revuw. 



A THEORETICAL EXAMINATION OF CRITBBION^REPBRENCED TESTS 



A critetion*referenced test, according to Glaser and his 
coUeagues (lOK is one that is deHbccate^ constructed to 
ffv^ scores that tdl what kinds of b^iavk^ ^Ddividuals 
with those ^cms csn demonstrate. All cm should share 
several features in common: 

1. They should be based on clearly defined edttcattimal 
tasks and purposes. 

2 . Test items should be specifically designed to measure 
the purposes and tasks* 

3. Scores should be interpreted in tarns ofattaimnentof 
a ftteset criterion or level of competence with respect 
to the purposes and tasks. 

i 

Other definitiona of cars have also been ofiimd (10, 11^ 
i3, 28). White these definitiona differ conmderabty in 
terms of the limitations and constraints placed on a crite* 
rion*referencttd'test, thi^ all involve reporting test scores 
in terms of achievement of educatk>nal tasks. 



How CUlecioft-BafaeBcad Testa Are Devebpad 

To develop a ciTt test ftems» objectives, and score mter^ 
pietations must.be fotmulated aiid generated. 

Fonrnttatingondgmt/utingobJ^etivtB. Qneofthebasic 
features at cars is their fonndatiim on a clearly d^ned set 
of educatiofial tasks and purposes, cn objectives can be 
sdected in at least five ways: 

1. Consmsus judgment. Various groups such aa com^ 
munity representatives, curriculum expertSt teachen. 
and/or school admini^tratore. decide which educa* 
tional tasks and purposes they coi^sider to bethemost 
important to measure {22* 31). 

2. Curriculum analysis. A team o^ currkulum eiiiorts 
analyzes a set of curriculum laaterials in order to 
identify and, where necessary^ infer the educational 
tasks and purposes that are the focus of the test (1). 

3. Expert analysis of the subject area to be tested. An 
in-depth anafysis is made of an area, such as matbe* 



Each of thtt eight components represent a sepuAte sec* 
tkmofthecitnratiiig form and b described bd^ (Com* 
pletecopieeof the form and m^ng ttutntcttons are 
|m)vided tn the second aection of this p&per.t On the form, 
weighted Hons are printed tn italics. Weighta for dass^ 
room purposes are tn parentheses and for evaluation pur 
poses ai« outsiik the parentheses. The basic rale for apply* 
ing w^gltts is that when soorea are computed by summing 
w^hts, high scores indicate better cars. To make the 
cmm as meaningful as possiblei users of the system can 
choose differcfUt items for weighting or change the value of 
the wights. 



CMnponent 1: Marketing aad Packa^g 

The first concern of the cmma is with the scope of theentire 
ciT across all grade levels— that ist with the content and 
skills it assesses and the grade or achievement tevds at 
which forms of the car are available. Because program 
evaluations frequently involve longitudinal data collection 
.and/or several different grade levels, cars that are avat!* 
able at many levels are particularly valuable in an evalua* 
tion context. 

The next concern is with the way in which the car at a 
particular grade or achievement level is organized. A crt's 
format and organization are usually determined ^ its in* 
tended function(s> relative to the various kinos of con- 
straints imposed on its development and use. Forexamploi 
cats designed as classroom aids for individualized instnic* 
tion programs would have, at each grade levdt many short 
tests each attending to a 3pecifi£ objective or cluster of oh* 
jectives. On the other hand, cars designed for use in pro* 
gram evaluations would have fewer tests that measure 
more general objectives. A major feature of any crt is that 
test items are designed to measure specific objectives. 
Consequently t it is important that theobjecttves be listed. 
The flexibility to select objectives and test items varies 
constdoabty among cats. Some cars offer a bank or pool 
of items each referenoSTto an objective firom which users 
can create their own tests. Conversely, some cars offer 
only one pre-formatted test per grade level. Also some 
CITS have two parallel or alternate forms of each test, 
while othm do not. 

Still another concern involves the materials included as 
standard or optional features of the car. The materials 
that are offered as part of the car vary consklerably from 
publisher to publisher and can range from jus t a collection 
of tests to a system replete with audio-cassette equipment, 
test copies on spirit masterst and a host of resource guides. 
In addition to the basic car package, inservice training 
programs sometimes may be obtainable from the publisher 
and so may be other support services such as record keep* 
ing and computeriied scoring systems. 

Cost factors must also be considered when discussing 
the mariceting and packaging of a car system. The cost of 
purchasing and administering the test nnist be affordable. 
Finally, the car materials must be of acceptable physical 
quality. 



Component 2: Eiaudnee Appropciatencss 

The second ^mponent of the deals with the appro- 
priateness of the car's test itms, instructions, formats 
timings and procedures for recording :inswers (or eiam* 
inees at tbe ach^ement or grade Uvii designated by the 
publisher. In particular^ the tasksi vocabulary^ and level 
of reading required by the car's test items must br 
matched to examinees' edu jattooal espeiteioe and matur- 
ity. Similatly, instructions should be unambiguous .and. 
easily understood, and the car's format (the organization 
of printed jnaterials on a^page)* illustrations ^nd prmt^.^ 
and auditory presentations (cass^tes) must be suitable for 
those being tested. Ftnallyi the timii^ and pacing of tests 
and the procedures (or recording answers also must be 
tailored to the examinees. 



Component 3: Admir^Hrative Uaabnity 

How useful are cars in terms of the ease wCth whicfa th^ 
are administered, adi^teii^ sooredt and interpreted and 
their value in making educational decisions? 

One factor strongly affocttng a car's utility ia the train- 
ing necessary to administer the test properly. Since few 
schools have a staff that indudes resident psychometrists^ 
developmentat paychoIogistSt audiologists, or speech 
theraptstSi and since it is not feasible to contract for these 
professionals' services each time a student is tested, a car 
intended fco^ use in a dassroom context has greater utility 
if it caa be administered by the school's regular staff and 
preferablytby tbestudec^'teacberi by aparaprofessional, 
or by the students themsdves. On the other hand, this 
issue is not as crucial to uits btended for use in a pro- 
gram*evaluatk>n conteact, since mostevaluatorsare trahied 
in the admintstration of cognitive and p^chological test 
batteries. 

Another factor closely related to test adniinistration is 
the number of examinees that can bo tested in a stn^e 
group. In genaalt cits that have capabilities for both 
group and individual administration seen to be most prac* 
tical. However^ for individualized instructk>ni cais that 
can be tak^ individoaliy are essential and for laige'scale 
evahiatfonsi cars that can be administered in gnmps are 
niore desirable and cost effective. 

The administraUve usability of a car is also affected by 
the time necessary (or its administration. The average 
attention span does not generally extend beyond 20 
minutes for young children and one class period for more 
mature students. In additiont equipment and materials in* 
vdved in test and the simplicity or complexity of 
directions can inflt ince the ease with which a cut is ad* 
ministered. 

The order in which the individual tests that comprise 
the car must be administered has important consequences 
for a car's administrftUon. For example* cars that require 
a prescribed order for testing have limited usefulness with 
cuniculums that follow another sequence. 

The ease of the scoring procedure also affects the usa* 
bility of a car. Simple and objective hand- or machine* 



3 



scoring of testa ifl generally considered mote desitabk than 
dinicult and subjoctive scoring aystems. Although a otr'a 
uaefotnetfs may not be alteted to any perceptible degree by 
ati^t vaititiona in acoting difficulty* tests scored on a 
purely subjective basis are not recommended for use in 
large-scale evaluations. 

From a pragmatic viewpoint, wbQe ease of admmistra- 
tion» adapUtkmt and scoring are desirable in a cit» amuch 
more basic am&kletation is that the scores obtained be 
susceptible to meaningful interpretation. The availability 
of interpretation guides is consideied necessary to guaran- 
tee correct and consistent interpretation of cix scores. 
^S^fing^iMii»mr of swedes thutw commonly usedrgm' 
eraHy understood* and that require fow maUumatical con^ 
versions are desirabte. Similafly, scores interpretable by 
school staffs parents* and students are prrferrad to those 
denmndingtl^ slciilsofpsychoiiietriata or other a^^ 

The finat issue related to a cst^s administrative usability 
is the extent to which the test can be used to make educa^ 
tional decisions. Sometimes cars are accompanied by 
guidelines to translate test results into educatioDal deci*^ 
sions. When used in a program evaluation context* the car 
results should permit the identification of successM and 
unsuccessful ptograma* and when used as a classroom re* 
source, the err leaulta should be able to assist teachers in 
assessing a student's progress and in selecting the next 
units of instruction. A strategy that appears to have 
promise in this lattrar regard isthe referencing of objectives 
and test items to specific instructional materials. This 
strategy* often called "curriculum referencing/' guides 
students and teacb^ to the appropriate materials for 
additional and/or supplem^tal instruction. 



Component 4: Function and Purpose 

cars can be used by teachers as one of their regular class- 
room resources in individuidizing and evaluating instruc- 
tion. In this dassroon* -management contut* car results 
can be used to diagnose problems related to students' 
specific learning objectives; to place examinees with 
respect to an instructional program; to measure individ- 
uals' achievement or progress; and to assess overall learn- 
ing. In an evaluation ^texttCRTscan be used to measure 
achievement^ to assess the merit of an instructional pro- 
gram and/or to compare programs. Some cats are recom- 
mended by the publisher for use in a variety of wntexts; 
others are intended for use in just one. 

Component 5: Ob|ecttves Development 

The issues related to the fifth component of the cktub in- 
clude the specification of domains* the characteristics of 
objectives, and cats* match to instructional programs. 

One of the basic features of cars is their foundation on a 
clearly defined set of educational tasks and purposes 
which togetjier constitute thd car's domain** car objec- 

*Tb« aet of ftdiicfttiond iMka ud purposes that a ntea^uKs u »oitw* 
timetabled & domuD or universe <if content i2L 5>, Howevet, the urn 
ihmatn b by others to mean the for genenting test itema to 
mewte a specific objective i U)* Throughout this paper* the lint meaning 
will be used. 



tives can be selected or defined in at least four waya: 
I, Kjcpert/tid^^ment* Experteasaeea* on thebaataof their 
knowledg!& and experience in the field* the educational 
tasks and purposes that are the most important to 
measure* 

2* Concensus iudgnmt Various groups such as com- 
munity repreeentativea^ cuiticuhtm experts^ teadier&» 
and/or s^ool admitiiatrators dedde which educa^ 
tional taelcs and puTpoaes are the most important to 
measure U6» 22), 

3. CurHcuUm amdysia. A set of currkulnm matenab 
is analysed in oidor to Identify^ and» where necesaaiy^ 
infer the educational taalra and purposes that should 
be the focus of the car (3), 

4. Theories of teaming and instruction, A literature re* 
view is conducted and/or conaultanta called in to 
formulate weties or faimrchiea of educational tasks 
and ptttpoaea based upon the resuUa of psycbok>gical 
the(»y and research (13). 

No matt^ how they are derived, educational tasks and 
purposes are usually called objectives or bebavicHral obiec* 
Uyes. However^ it ehould be noted that these terns have a 
precise ineaning to educatorsr^^An objective is ah ^f^nt 
(authot^s itatical communicated by a statement dee^rQ)ing 
a proposed change in a teanier— a statement of what the 
learner is to be like when he has successful^ completed a 
leamiitg experietice'* (16). I>evel^[»ers of cvra do not 
alwayisjise thia definitton in ita purest sense. To tbem» an 
objective refers to the content that is supposed*to have 
be^ learned (equivalent and nonequivaleot sets in aiitb*^ 
grade tnatfa* for example) and sometimes includes the be- 
haviors tfie student is supposed to exhibit (naming the 
first five presidents of the U.S.X.). 

The set of objectives or domain measured by a cmr can 
be characterized in terms of its organization: It can be pre* 
sented without any structure^ it can be organised scoord" 
ing to major skiB areas assessed by the cart or it can be 
fortlter stiuctuted in tanns of hierarchies of tasks within 
skill areas. Whatever organization scheme is used, it 
should cleariy demonstrate the skeleton of the domain to 
be measured. 

Objectives can also bediaracterized in terms of the rules 
used to write them and how broadly or narrowly they are 
stated. Formal rules for generating and stating tibjectives 
are needed to ensure the unifofmityt manageabil%, and 
comprehensiveness of the set of objectives that the car 
measures. The level of graerality at which objectives are 
stated is affected by the size of the domain covered by the 
CRT* It is possible to cover a domain by a snail number of 
very generally stated objectives; however* objectives so 
stated may beambiguous. On the other handi detailed ob- 
jectives can cover a domain in less ambiguous terms; but^ 
to achieve this kind of clarity necessitates generating and 
stating a sizatle and possibly unwieMy number of objec- 
tives. 

Another concern closely related to domain development 
ts the match between the car's objectives and those of an 
instnicttonal program or curriculum* A crt*s match to a 
curriculum reflects the extent to which it has been designed 



for U8ft with a specific educational program (2, 20). crts 
with an ^tensive match to a cunriculum have objectives 
and test items that are dependent on a particular cuiricu- 
him or set of edu<»ttonaI matenals* while errs with some 
match to a curriculum^ on the other hand» have objectix^es 
and test items that are only sometimes dependent on the 
specific tasks or purposes of an educational proftram. Con- 
versely, CKTs with no match to a curriculum are based on a 
domain of tasks and purposes that ate independent otany 
.educational program. In a classroom context, it is gen- 
erally desirable for the crt to match the curriculum being 
used, while in an evaluation context* in order to be feur to 
all educatkinarprograms, it is usually preferred that the 
CR7 be iixlependent of any curriculum. 



Component 6^ item Development 

Once the purposes and objectives fQr a crt system have 
been delineated, the next step is to construct and/or select 
tasks or teet items to measure those objectives. This is one 
of the most difficult steps in the total test development 
process because there are a vast number of test items that 
might be constructed or generated for any given objective, 
even for those that have relatively narrow definitions. 

Since each test item must be linked to an objective, a 
question arises about the number of test items that should 
be constructed for each objective. Some of the Actors 
affecting the answer are the amount of testir^ time avail- 
able and th^ cost of making possible interpretation errors 
(such assayingthat a student has achieved mastery when 
he or she has not) . More items are needed for some objec- 
tives than foi^ others to obtain a stable estimatt^.^ of learners' 
performance. Moreover, a set of test items that samiiles 
the range df beb&^^iors and concents associate with an ob- 
jective is more likdy to give an accurate assessment of an 
examinee's performance than would a more restricted set 
of test items. 

Some of the strategies and procedures used to construct 
test items include: 

1 ' Panels of experts. A group of measurement and cur- 
riculum experts decide which items to use based on 
their knowledge of, and experience in» the 6ekl. 

2. A conUnt process matrix. Dasica% a variatkinonthe 
classical test*construction technique, thU' approach 
involves devebping for each objective a matnx of the 
contents and b^viors to be assessed. Items are then 
systematically sampled &om the cdls of tht- matrix 
and perhaps along a third continuum of item difficul* 
ty as well (22). 

3. Systematic item generation. Basic ''item forms" or 
specifications are developed for each objective that 
define the range of item diffkmlties, all the relevant 
contents and behaviors, and stimulus and response 
characteristics of items that can be used to assess the 
objective (10, 11, 5, 20, 19). 

The procedures used to guide item writing can have a 
direct bearing on the utility, validity, and score interpreta^ 
tioTis of CRTS. For example, crt systems that use specific 
guidelines for item construction aremore likely to measure 



all the relevant skills and behaviors being assessed than 
those that do not. Moreover, specific guidelines permit the 
devebpmeht of additional parallel test items if tlxey are 
desired. Without the guidance of a systematic plan, it is 
very easy to construct or genmte items for those aspects 
most amenable to measurement rather than those that 
might be most germane or critical. It also seems likely 
that responsible test developers working with an overall 
plan are more apt to focus tl^ir attention on the most 
salient (and perhaps niost frequently taught) facets of an 
objective rather than include those components that may 
only be tangential to student learning. No matter wh^ 
strategy is used to construct car itmisi guidelines for item 
wrfting should include comprahenatve rules for the specifi- 
cation of tasks, conditions, and content for test items. 

To what degree should items be sampled to compare 
their relative difficulty and possible content coverage? It 
is a weU-known and fi;oquent^ used principle of test con- 
struction that even slight changes in an item can a^ect its 
difficulty. The extent to whicfa the items are sampled with 
respect to dif ^ulty has a direct bearing on the inter|Nfeta- 
tion of the scores obtained. In other words, if only the 
most difficult items are used to measure an objective* the 
phrase achievement of the objective will have a very dif- 
ferent meaning than if the items are sampled o\'er the full 
range of difficulties. 

An issue related to item writing, and one which has per- 
haps not received as much attention as it should, is the 
potential interaction between the objective and how it is 
measured. It is often assumed, for example, that selected 
response items (such as mult^le^choke questions) serve 
as an effective proxy for constructed response items (such 
as completion or short*answer questions) because the per- 
formance of students on the two kinds of items is highly 
related. Although this may be generally true, it may n<^ 
be true for obtain kinds of objectives. Further, the degree 
of mastery required to answer a constnicted response item 
is usually greater than that needed to answer a selected re* 
sponse item. Despite the obvious advantages of the former 
format, the ease of scoring items using a selected response 
format has ted to its almost exclusive use in published 
measures, including crts. 



Component 7: Methods of Score Interpretation 

One of the distinctive features of a car is its ability to pro- 
vide a means for describing what an individual or group 
can do, knows, or feels without havir^ to consider the 
skills, knowledge^ or attitude of others. Consequent^* crt 
scores are reported and interpreted in terms of the level of 
performance obtained with respect to the objective's) or 
domain on which the crt is based. This type of score re- 
porting is very different from that used for norm-refer* 
enced tests in which scores are reported in terms of the 
performance of other individuals or groups. 

Crfterion^referenced score interpretations can be ex- 
pressed in several ways, crt scores can be reported as the 
percentage of individuals who correctly answered each 
item (the item's difficulty). This score is used primarily 



wh«n orUy one item is tested per objective— for example, 
as b the case with National Assessment of Educational 
Progress. Reporting an individual's or group's actual level 
of performance as the percentage or number of items 
rectly answered fbreadi objective is another very common 
x-ty of expressing crt scores. An empirical variation oi 
this score b the estimated "true" level of pCTformance, 
referringto the portion of the total universeof items for an 
objective that an individual or group could answer cor- 
rectty. Mastery interpretation schemes report scores in 
terms of whether or not a pre-set performance level has 
been achieved* and an individual is described a^ having or 
not having mastered a given objective. Selection of the 
criterion level of performance for a mastery score interpre- 
tation should not be arbitrary, but should be justifiable 
and based on a concept of mastery. Experience, theories of 
leammgt or experiinentation ^^an be used to justify a con- 
cept of mastery. Non;;rbitrary definitions of mastery have 
been ofliered by Novick and Lewis (17)* Karris (8)* and 
.chers. In some of these mastery- systems* several cate* 
ffories are employed to distinguifih between decrees or 
levels of masteiy. crt scores also can be reported in terms 
of the level of p^ormance achieved after a certain amount 
of learning time or as the probability of passing the next 
unit of instruction. 

Scores on cars need not be limited to just a crt interpre- 
tation. Other score interpretations can also be provided to 
expand upon the crt interpretation (14. i, 6).For6xampIet 
criterion -referenced information can be combined with 
norm-referenced information in the following way: "This 
school had an average score of 5 out of 10 on the objective 
{a CRT interpretation) which is one standard dev'iation 
below the national average of 7 out of 10 (a norm-refer- 
enced interpretation)." The idea of using both types of 
score interpretations is not new and it does not reduce the 
theoretical soundness of the score interpretation {4* 14* 
15). Combining score interpretations is useful for describ- 
ing what a student can be expected to do and how excep- 
tional or typical this performance is. Comparative or norm- 
referenced scores are typically reported in terms of stan- 
dard score scales* age/grade equivaients* and percentiles. 

The type of scheme used to report scores is, in part, 
determined by the context in which a crt is used. Keport- 
Uig results as the percentage of items passed per objective 
can be meaningful in a classroom context if the objectives 
are carefully matched to the curriculum. Howeven in an 
evaluation context* this type of interpretation alone may 
be inadequate because it provides insufficient information 
for decision making and loses meaning outside the class- 
room. For evaluation purposes, it is probabty useful to 
supplement this score with comparative data or to use 
scores whose criterion levels have been validated empir- 
ically or based on theories of instruction and learning- 
Component 8: Analysis and Validation 

It IS axiomatic that all tests and measures must be vali- 
dated before basing decisions upon them. This process can 
Involve giving the test to students and studying their re- 



sponses (response data) or relying upon review by exp£ -:ta 
(judgmental data}. The issues addressed in this component 
of the CRTDK involve both of these proceeses and incbide 
the characteristics of field tests conducted to certify cuts 
and dimensions of item and test quality. 

A CRT should be field tested on a sample of individuals 
who are representative of those f^r whom it was intended. 
Since most commercially published cets are intended for 
widespread use, they should be tried out in a large-scale 
field test with samples that are geographically and 
ethnical^ representative of the nation. 

There is much ambiguity about the procedures appro* 
priate for analyzing the data from cm* field tests. Never- 
theless, thete aie several dimensiona of item and test 
quality that are considered to be relevant to crts and that 
have associated with them review procedureSt data collec- 
tion strategies, ezperim^ital deetgns. and statistical, 
indexes. In recogmtion-of the uncertainty in the fi Id with 
reapect to the psychometric characteristics of a good crt 
and the methods for measurii^ thdr presence/abaencet 
the CRTDS system only inchtdea the dimensions of car qual- 
ity that are attended to by test publishers.* 

There are five dimensions that can be used to assess item 
quality. They are: 

1. Item^objective congruence* A t^t item is consid^ed 
"good" if ft measures or is congruent with the objec- 
tive that it b supposed to assess. Item-objecUve con- 
gruence can be established by usingjudgmenta) data. 
Typically, content experts are givena variety of objec- 
tives and the item used to measure them and are 
asked to comment on the appropriateness of the item- 
objective relationship. 

2. Equivalence {intemai consistency)* An item is con- 
sidered "good** if it "behaves" like other items that 
measure the same objective. The concept is simitar to 
item-objective congruence, but its proper use depends 
on response data. Equivalence is usually measured by 
computing the bisenal correlation between the score 
on an item and the ^ota) score on all items measuring 
that objective. It should be noted that for broadly 
defined objectives, internal consistency wiD be lower 
than for narrowly defined objectives* and this must 
be taken into account when using intfihial consistency 
data to make decisions about item quality. 

3. Stability iox/er time). An item is considered "good" if 
examinee performance is consistent from one test 
period to the next in the absence of any special inter* 
vention (such as instruction, vAdch b an intervention 
that can change examinee performance) . Stability in- 
volves response data and can be measured with the 
phi coefficient. 

4. Sensitivity to instruction. An item is considered 
''good*' if it b sensitive to instruction—that is. if the 
item is able to discriminate between those who have 
and those who have not ben^ited bom. instnictiont 
assuming that instruction was adequate. This mea* 

*S<-vera1 of these dim«n5ton3-'for example* item and tesl eatim»te» cf 
4H]uiv<itenc« and suibUtly— depend upon i *e variobtlity cf perftxTm^nce in 
ihe field iesl sample, Tc b« meaning^l. ihe Juunple mti^l be tepresenU- 
iii^ ot the populaiion of intere^l and contain ^ufficieni vuiatioa. 



6 



sure oC item quality is usually computed for cets that 
are linked to particular educational programs 
that require respome data. Typically* exAmmees are 
tested before and after an educational program^ and 
those items that many exanunees fail before tnstruc* 
tioD but pass after instruction are considered to be 
sensitive to the instruction. 
5. Cultural/sex bias. An item is considered ''good" if it 
leads to accurate inferences about the knowledge, 
skillSf or other attributes of an individual or group. 
Bias can be assessed using either judgmental or re* 
sponse data. If the former are used, representatives 
of different cultural groups, members of each sex, 
and/or linguists examine test items to determine 
whether vocabulary or content could be mistnter* 
preted. If response data are used to assess bias, they 
are analyzed (typically, using amova or tegreasion) for 
item-ctiltural/sex interactions. 

There are six dimensions commonly used to express the 
quality of a crt. They are; 

1. Test-objective congruence. Similar to item -objective 
congruence* test- objective congruence assesses the 
extent to which the total test or subtest measures the 
relevant objective. Test -objective congruence is 
usually determined by using judgmental data. 

2. Equivoience (mtemoi consistency). Test eqnivahnce 
measures the homogeneity of test items for &n objec- 
tive—that is, how coherently the test items assess a 
particular objective. Equivalence can be estimated by 
using split-half correlations, Kuder- Richardson 
formulas, or coefficient alpha. It should be noted that 
internal consistency estimates will be lower for more 
broadty stated objectives. 

da. Stability (test-retest* or alternate forms). A test is 
stable to the extent that examinee responses are con- 
sisten t ttom one test period to another or across alter- 
nate fonns of a test in the absenceof any intervention. 

db. Stability (number of items per objective). A deter 
mination is made of the number of items that should 
be tested in order to obtain a stable score on an objec- 
tive. For .this type of stability, the assumption i^ 
made that for each objective there is a pool or popula- 
tion of items with mixed difficulties that deals with 
the objective and that for any given tat t» a sample of 
those items is selected. Stabili^ can be estimated 
with response data using correlation techniques and/ 
or Bayesian models (17). 

4> Sensitivity to instructian. Thismeasureoftestquality 
is usually obtained for cars thatare linked to a specific 
educational program. It can be obtained from 
sponse data by comparing scores of those who have 
and have not received instruction. 

5. Cultural/sex bias. Bias in measurement occurs when 
characteristics of the test, the testing process, or the 
interpretation of test r^^sults lead to inaccurate infer- 
ences about the knowledge, skilts» or other attributes 
of indivulujils or groups (1). Bias can*be measured by 
AHOVA or regression techniques usir^ response data 
or by expert review using judgmental data. 



6. Cntenon validity. Criterion validi^ establishes tlie 
meaningfjlness of the criterion in terms of which ckt 
scores are interpreted. Establishing criterion vaUdity 
makes use of classical validity measures and is either 
a one*step or u two-step process: 

Step I: The first step involves assessing the mean* 
tngfulness or content validity of tie domain: that 
objectives have been selected and organized to be in 
themselves educationally significant and that test 
items have been systematically generated to cover 
theobjectives. Step.l is. usually established by hav* 
ing experts review the objectives and test items to 
determine the extent to which they were developed 
in conformance with prespecified procedures and to 
which they cover the domain in a comprehensive 
and meaningful manner. 

Step 1 mustbecompletedfbrallcRTsandtinsome 
cases^ is sufHcient for establishiug criterion vaUdity. 
One example of a test tliat requires onJy Step 1 crite- 
rion validity is a crt based on objectives that are 
narrowly defined and operationally stated in such 
detail that generatii^ items requires only tranapos- 
ing the objectives into question form, cat score in- 
terpretations of objectives with these characteristics 
are meaningfiil because the objectives describe skills 
that can be measured directly by test items. In a 
second case, the car's objectives are Unked to a cur- 
riculum and interpreted by teachers or curricular^ 
experts, crt score interpretations are meaningful for 
these objectives because the skills and knowledge 
measured are those taught in classrooms using a 
specific curriculum. A third case in whkh Step 1 
validity is sufficient is when comparative data are 
provided or when the crt score interpretation for 
each objective la supplemented by a normative 
interpretation. 

Step 2: In Step 2, criterion validity is established 
through empirical means and involves determining 
whether examinees who perform welt on the test 
have really achieved the educational objective. Step 
2 criterion vaUdity can be measured by comtjaring 
scores obtained by individuab who* in advance of 
taking the crt and using independent criteria* are 
}udg^ to possess or not possess the skills that the 
ob/^ctiveis intended to measure. To the extent that 
the CRT discriminates between these two groups of 
individuals* the car has criterion vaUdtty-. (Note 

" chattfanobjectiveordomainlaconsuleredanalagoua 
to a psychological state, then Step 2 criterion vaUdity 
can be likened to construct validity; otherwiaet Step 
2 criterion validity can be likened to concurrent 
validity.) 

By establishing Step 2 criterion validity, the rela* 
ttonship between test items and the objectives they 
are supposed to measure is empirically confirmed. 
Step 2 criterion validity permits assertions about 
mastery of the individual objectives that comprise a 
domain and about more complex behaviors whose 
component parts are defined by the domain. 



APPENDIX 



RATING FORMS AND INSTRUCTIONS 



The CRTD& rating system has been designed 90 that school perscmoel, test-aelecticm 
committeeSf researchers* and professional evaJuators can. prepare a description 
and evaluation of critmon*ref erenoed testa. In the following section of this paper, 
the rating form and instmctions for Its use are presented 



1 : Matketing and P^du&ii^* 


J. SCOPE 
(of Total CRT) 


Grade Levels Tested 

(total CRT) ^ 1 2 3 4 5 6 7 8 9 10 li 11 

Grade Level Coveriige 0(0) ODoesn't cover 0(1) OCbvers needed grades 
{ total CR T) needed grades 


2. TEST 

ORGANIZATION 


Number of Separate Tests 

List of Objectives 
Flexibility to Select Items 

Answer Sheet Format 

Test Length (range) 
Alternate Forms 


2(0) Dl test 

□ Nfo 

□ No 

□ Hand scoreable 
only 

□ to 

objectives per test 

□ No 


0(1) n2'9 tests 

□ Yes 

□ Yes, some 

□ Machifie scoreable 
only 

□ to items 

per objective 

□ Yes 


n Not applicable, 
0(2) □ 10 or mare tests No pre-formatted 

tests 

\^ 

^ □ Yes, extensive 

□ Hand orniachine 
scoreable 


3. AVAILABLE 
MATERIALS 


Tests 

Technical Manual 
User s Manual/Guides 
Answer Sheet* 
Cassettes/Special Eijuipment 
Student Report Forms 
Resource Books 


□ No 

□ No 

□ No 

□ No 

□ No 

□ No 
^ □No 


□ Yes, standard 

□ Yes, standard 

□ Yesi standard 

□ Yes, standard 

□ Yes, standard 

□ V*s, standard 

□ YeSi standard 


□ Yes, optional 

□ Yes, optiond 

□ Yes, optional 

□ Yes, optional 

□ Yes, optional 

□ Yes, optional 

□ Yes, optional 


4. OTHER 
PUBLISHER 
SERVICES 


inservice Training Available 
Scoring of Tests 


□ No 

□ No 


□ Yes, standard 

□ Simple scores 
only 


D Yes, optional 

□ Extenf^ive score 
summaries 


5. COSTS 


Cost per Student at a 
Given Grade Level 








6. QUALITY OF 
MATERIALS 


Physical Quality of Materials 


□ Pbor 


□ Good 


□ Very good 



information sources for each component of the CRTDE may be fc d in the CRT's test booklets, examiner's manual, and/or technical manuaL It should be noted that not all CRTi have 
these three parts nor is information similarly organized from puWisher to publisher. ^ 



2. Examinee Appropriateness 


!. TEST ITEMS 


Study of Test Item's Appropriateness 
Vocabulary, Brevity, Clarity 
Tasks Requited of Examinees 


□ Not reported 

□ Inappropriate 

□ Inappropriate 


□ Expert judgment 


□ Response data 

□ Appropriate 

□ Appropriate 


2. INSTRUCTIONS 


Vocabulary, Brevity, Clarity 
Iliustrative Sample Items 


□ Inappropriate 

□ Not present 


□ I^esent but not 
.clarifying 


□ Appropriate 

□ Effective and 
clarifying 


3. FORMAT 


Test P^ge Layout 
Illustrations and I^int 
Auditory Presentation 


□ Complicated 

□ Unckar 

□ Garbled 




□ Clear 

□ Clear 

□ Clear - 


4. TIMING 


Timing and Pacing 


□ Inappropriate 


□ Appropriate 


□ No guidelines 


5, RECORDING 
ANSWERS 


Response Scheme 


□ Complicated 




□ Simple 



3: Administntive Usabflity 


1. ADMINISTRATION 


Size of Testing Group 
Admmistrator 
Administration Time 

Directions 
Equipment 

Order for Testing 


0(1 ) □ individual onh 
0(0) □ Specialist 

□ Shortest possible 
lime' 

□ Not available 

□ Special equipment 
needed 

G prescribed sequence 


2(1) UGroupsonly 
1(1) USchooi per^nnel 

□ Longest possible 
lime* 

□ Available^ incomplete 

□ No special equipment 
needed 

□General guidelines 


2(2) U Individuals^ group 
1(2) USelf-of^imstration 

□ No time limits 

□ Available^complete 

□ None 


2. ADAPTABIUTY 


Modification of CRT 
by User 


□ No 


□ Ves^ limited 


□ Yes, extensive 


3. SCORING 


Ease of Scoring 


□ Subjective 


□ Objective 




4. r 'TERPRETATlON 


Score Interpreter 

Score^lnterpretation Guide 
Score Interpret ability 


□ Specialist 

< □ Not available 

□ Complicated, unusual 


□ School personnel 

□ Available, complicated 
□Simple^standard 


□ Self*interpretiiig 

□ Available, not 
complicated 


5. DECISIONS 


Deciston-Making Utility 
Curriculum Referencing 


□ No guidelines 
0(0) uNone 


□ Ves> limited 
0(1) U Yes. some 


□ Yes, extensive 
0(2) □ Yes. extensive 



\ 



4: Function and Purpose 


Test Uses 

I . PURPOSE (CircU all that apply) 


0(]) 0 Diagnosis 

2(1) 0 Achievemertt/outcomes 

0 Other J please specify: 


(^\)n Placement 

2(0) □ Comparison of instruction 
programs 

*^ 


1(2) 0 Achievement/process 



16 



17 




5: Objectives Oevetopment 


I, DOMAIN (OBJECTIVE) 
SPECIFICATION 


Domain Defmition 
Domain Structure 


□ None 

□ Content area 


□ Vague, general 

□ Content/process matrix 


□ Specific 

□ Objectives/item 
generation format 




2, OBJECTIVES* 
CHARACTERISTICS 


Organization of Objectives 

jjevel of Generality of 
Objectives 


□ None 

□ General 


□ Simple list (no structure) 

□ Specific 


□ Categories (strands) 

□ Very detailed 


□ Hierarchy 


3, MATCH TO 
INSTRUCTION 


Curriculum Match 


2(0) DNone 


1(1) nSome 


0(2) □ Extensive 





18 



19 



6: Item Development 


MTEM'OBJECTIVE 
RELATION 


Items Coded to Objectives 
Number of Items per Objective 

Scope of Coverage of Objectives 


0{0)ONo 

0 Not applicable-Items 
not coded to objectives 

□ Poor 


2{2) D Ye& 

0 Range: to 

items/objective 

OGood 


□ Average: 


2.1TEMGENERAnON 


Rules for hem Writing 


0 None 


0 Suggestions 


0 Specifications 



0 



21 



7: Meihodi of Score Inieipreiaiion 


I.CRITI'IRION* 


Scores Rcporie'I hi 
Terms of i.rvei of 
Pnforhtance 
(Check ail t$mt 
apply) 


0(0) U item difficulty 

0(0) U Arbitrary mastery 

1(1) □ Achievement levc^ 
afiera certain 
amount of barmng 
time 


(KO) OAcmaiscore 

(Percent of items 
Correct) 

1 ( 1 ) □ Empirical mastery 

1(1) □ Probability of 
aclr.eviiig next 
level 


1(1 ) □ True score 
( Percent of 
ohfettlve 
achieved^ 


1 NORM^REhHRHNCED 


Comparative Scores 
Rejjortvd as well as a 
Criiction- Referenced 
Score (Ciwi kali 
that apply) 


1 ( 1 ) □ Standard score 

scales * 


1(1) UAgelgrade 
apdvalettts 


1(1) U Percentiles 



S. Analysis and Validation 




Field Tcsl Reported ' 
Scule 

Seopc-<jCogrupliie 
Seope-i:ihnic 

Sample Repiesentattvcness 


□ No 

□ Small 

□ Local 

C Lnile ethnic represeniaUon 

□ No £.'»mpling plan 


□ Yes 

□ Modem te 

□ Regional 

Q Elhnie representation 
G ProbubilUy sampling phit 


0 Large 
□ National 

U SpeeiLiI suniptK'i* <iiun-probabilit> ) 


2JTi:M OUAUTY 
(judi>mciitul djtLi) 


Eiem-Objeeu^'e Congruenee 
Ciilitiral Bias 


□ Noi repo/ted 

□ Noi reported 


O RcportevI (Give valu^*: ) 
□ Reportt^d (Give */alue: ) 




OUAUTY 
(response <i:ii3) 


Sensiiivuy to Insiruciion 

^quiVLilcnee (item-objeeiive 
iniernaJ consisiency) 

Stahtlity 
Cutiiirul Biaii 


□ Not reported 

n Noi reported 
Q Not reported 

□ Not reported 


□ Reported (Give value: ) 

□ Reported (Give value: ) 
O Reported (G:-e value- ) 

□ Reported (Give valuf ; 




4. Ti':ST QUALITY 
(jii<iltmoniuldjtj) 


Teu- Object ttv Congmma* 
(a went validity) 

Cutturul iHa% 


0(0) O No t reported 
0(0) O i\Uft reparted 


1 ( 1 ) Q Hepi^ted (Give vatue. f 
i ( 1) □ Reported (Give vahic; / 




5.T(^ST OUAUTY 
(response 


Scftsifiviry to Uiumaion 

Fqiitvatan'c (tntvrmt ion- 
sfu^'ttvy by (*i>}ci'!ivc} 

S'ahih'ry (tvit-rete^tl 
aUcrnatv formif 

Srahihry (mitubcrof 
}{vms pertthjecfin*} 
Cntenon VaUiUtv 

Cuttttrat Hiai 


0(0) 0 Nm reported 
0(0) O Nor reported 

0(0) UNvo reported 

(^i{^)n Not reported 

OtO) □ iWtrf refyorred 
0(0) QNotreporred 


\(2) O Re'Hmed (Gin* i-ahte. } 
2(3) Q Repitrted (Giir value. ) 

2(2) □ Reported iOive value J 

2(2) n Reported Kove value* } 

2(2) OReinyrtedfOive value: J 
2(2) O Repttr ted f Give value: J 





24 



[- — * — 

1 : Mttketing and Packi^* 


U SCOPE 
(of Total CRT) 


GtadeLeveh Tested 

(total CRT) KI234S6789 I0ni2 

Cradt Level Coifcragc 0(0) O Doesn't caver 0(1) O Covers needed gmdes 
(totai CRT) needed grt^es 


2. TEST 

ORGANIZATION 


Number of Separate Tests 

list of Objectives 
Flexibility to Setect Items 

Answer Sheet Format 

Test Length (range) 
Alternate Forms 


2(0) □ / test 

□ No 

□ No 

□ Hand scoreable 
only 

□ to 

objectives per test 

□ No 


0(1) n2'9 tests 

□ Yes 

□ Yes, some 

□ Machine scoreable 
only 

□ to items 

per objective 

□ Yes 


nNotappHcoble 
0(2) □ 10 or more tests No pre-formatted 

tests 

□ Yes* extensive 

□ Hand or machine 
scoreable 


3. AVAILABLE 
MATERIALS 


Tests 

leciuiical Manual 
User's Manual/Guides 
Answer Sheets 
Cassettes/special Equipment 
Student Report Forms 
Resource Books 


□ No 

n Mn 
LJ £iU 

□ No 

□ No 

□ No 

□ No 

□ No 


□ Yes, standard 
u les^sianaara 

□ Yes, standard 

□ Yes, standard 

* *nia*\a*a-iw 

□ Yes, standard 

□ Yes, standard 

□ Yes, standard 


□ Yes, optional 
LJ ICS, optional 

□ Yes, optional 

□ Yes, optional 
G Yes, optional 

□ Yes, optional 

□ Yes» optional 


4. OTHER 
PUBLISHER 
SERVICES 


Inservtce Training Avaibble 
Scoring of Tests 


□ No 

□ No 


□ Yes, standard 

□ Simple scores 
only 


□ Yes» optional 

□ Extensive score 
summaries 


COSTS 


Cost per Student at a 
Given Grade Level 








6, QUAUTY OF 
MATERIALS 


Physical Quality ofMaterials 


□ Poor 


□ Good 


□ Very good 



'Information sources for each component of tbt CKTDE may be found in the CKT"; test booklets, examiner's manual* and/or technical manual* It should be noted that not all CKT$ have 
these three parts nor is information similarly organized from publisher to publisher* 

26 



BATING INSTRUCTIONS FOB THE CRTDE 



1> MAKKETINC AND PACKACINC 

1. aoott (Scope of the total CBT system: tbit is, taking into 
acoouiit all tests for alt grade or achievwnent levels con* 
sideied together.) 

a) OrodM lAVtis Tested (total cit> 

K*12-ClTtle each grade level far which test materials 
aje available. 

b) Grad£ Ltevel Coverage (total car)* 
OiOHhesn^tcoverneededgrades'^FonnB of the test 
aje not avaOable for all needed grade levels. 

2 U) Covers needed gnuies— Forms of the test aie 
available for eftch needed grade level 

2. TtST OSGAHIZATtON** 

al Number of Separate Tem 

(Most OTS are organized into a series of independent 
teats each measuring one or a few objectives; some 
have just one test per grade level and others have no 
preset tests, simply an item pool Irom which teats are 
made to order.) 

2V3fil test-^OrAy one presrt test ts provided at this 
achievenient/grade level. 

0 it) £<9 mts— Two to nine preset tests are provided 
at this achievement/grade level. 

0 (2) iO+ ee^e^^Ten or more preset tests are pro* 
vided at this achievem^t/grade level. 

0 {0) Not oppliOibte—Uo preset tests are pixivided. 

b) List of Objectives 

No— A list of the car objectives; is not provided. 

Yes— A liat of the car objectives is provided. 

c) FtexibiUty to Seiect Items 

No— All items for a given objective must be used. 
This is usua^y Uie case with pre*fonnatted tests. 

Yes, some'-There is some opportunity to select ttue 
items that can be used to test an objective. This is 
the case with cars that provide detailed ttesn^wnting 
rulest sin^ paraUet items can be generated. 

Yes, eitenstve-- There is freedom to choose items to 
test an objective. An sample of this situation is a 
car that has an item pool fa>m which tests are 
custom*made. 

d) Answer Sheet Format 

Hand scoreable ottly--Tests must be scored man- 
ua4y. 

Machine scoreable on^-- Tests must be machine 
scored. 

Hand or machine scoreable— Tests can be scored 
manually and by machine. 

*A*sigiiwdf:Ht« b94Dit*tD«itt)Ulic«. If dwcit U|coingtob«iiMd tnwi 
ev*liuti(m«Qittntt UMtlM wdl^uouUki«th«p«mttl)««^ tf iti« a<>NF 
to bt luMd to « duvroom ctmtait, un th« wvij^t* in tht puvatlM*«*. 
JtMOtd tbt vsKtt ofibt «<lti>t in tb« box. 

**Not«: Frofn thi« point forward, it U «tsuiii«d that t cat for t ftngle 
gT«4« or achlfvttnrot l«v«J b beinig r»vl«w«i. 



e) Test Length (range) 

to objecttvea per test^ Range in numbor 

of objectives per test 

' * to -._items per objective— Bange in the num* 

ber of items per objective 

f) Alternate Forms (Parallel teats that measure the 
same content using dififorent but aquivalent test 
items! 

No— Alternate test forms ate not provided. 
Yes— Alternate test forms are provided. 



3. AvAiLABUE UATMSuaM, (The kinds of materials that are 
provided as part of the csr. F<v each item in this sub- 
section, a distinctiMi is made between those mateiiala 
that are providences standard parts of the car and the 
materials that are optional [that can be purchased 
separat^].) 

a) Tests (Pre*formattad teat formal 
No— Not provided as part of the car 

Yes, standard^ Provided as a tegular part of the cet 
Yes* optional— Not automatically included as part of 
the cet; can be purchased sepaiatdy 

b) TechniadMcnualiTidB is a report that desctibea the 
CRT system and the way in which It was field tested 
and analyzed. This manual should present statistical 
indexes of r^biltty and validity.) 

No^Not provided as part of the cbt 
Yes, standard— Provided as « regular part of the car 
Yes* optional- Not automatically included as part of 
the cit; can be purchased s^[katatety 
c} tA5er'^Jtfaniuii/Gt»des(AmanualorbrDchurew^ 
for teachfift and others who will administer the caTS, 
tb» manuals usuai^^ include detailed instructions for 
test administration and use of the system.) 
No^Not provided as part of the cwr 
Yes, standard— Provided as a regular part of tha cet 
Yes* optionat— Not automa t icaUy included as part of 
the cbt; cairbe purdiased sepaiitely 

d) Answer Sheets (Pre*f6rmatted foams for recording 
answers) 

No<-Not provided as part of the car 

/4 Yest standard*- l^ovided as a regular part of U» cet 

Yest optionat^Not automatically included as part of 
the cbt; can be purchased separately. 

e) Cassettes/Special Equipment ^^pBctMeittMt video 
equipnient, etc., required to administer and/or score 
the cet) 

No^Not provided as part of the cbt 

Yes, starulard— Provided as a regular part of the cet 

Yes* optional*^ Not automatically included as part of 
the cet; can be purchased separately 



28 



19 



. 

m' 

f) Stud^t Rtport Form (Special forms for docuInell^ 
ing tht ppogrm of each studnit) 
No^Not provided as part of the car 
Y€e, atandaid-^Pnyvided aa a tegular part of the ciT 
Yea, optional^ Not aatomatkalfy inchtded as part of 
the cat; can be putcbaeed ae^ktatify 

g) lUsQurct Booki tSpedal atdi aaaociated with tHs 
cav^for example* cmricohim gnidee that refermce 
pages in text where the car'a ol^ectivee aie ccvmed) 
No— Not provided aa part of the car 
Yes, standard^ Provided a:s a tegular part of the cit 

Yes, optional^ Not automatically inchided aa part of 
the caTf can be purchased separate 

4, ofstK POBuaaia saavicBa (Actjutict mrvh:es available 
from the publiaber) 

a) lns€fvic€ Truimng Avaiiahk (Instruction in the uae 
of the car ^stem) 

No<-Pobliaher does not make available 9ny form of 
inservice training. 

Yes, atandtol-^Pttblisber offiera inservice training as 
a standard part of the cit. 
Yes, optional— PubUsber dHiara inservice training 
which can be purchased separate from the car. 

h) Scoring of Tests (Seonng services that usually in^ 
volve saiding tests to the publisher for scoring or the 
reaatal/poichase of special computer programs for 
scoring tests) 

No^Scoring servi4»8 are not available from the 
publisher. 

Simple scores ooly-^Only individual test^scoring ser- 
vices are availabte (no summary infoimattan). 

Eitensive score summaries— In addition to indi- 
vidual test scores, aggregated scores and otbar sum* 
maty data are available. 

O. COSTS (Of purchasing the car sfystem from the publisher) 
Cost per Student at a Oiven Orade Level (Cdet per 
studont of the car at a given grade or acfaievement 
level is based on a nrinimum purchase. Thus, if the 
teet is sold in lota of 35^ the coet per student would be 
1/36 of thelot fffice.) 

6. QUAUTT or MATtaiALS 

Physical Quality of the Materiais {A subjective eval* 
uatton of the cm* materials) 
Poor— Below average quality 
Good— Average quality 
Very good-- Above average quality 



2. EXAMINEE APPROPRIATENESS 
1. Twn trxus 

a) Study of Test Items* Appropriateness (This includes 




investigations, condocted hyihe pubilsharrf^ tbe^ 
appropriateness of ibe teat Itans tor the intended ei* 
aminees. Such inveettgations are usually documented 
in technical manuata or naet^a guides,) 

Not reported— An investigation of test items' ap* 
proprirtoaesa is not iep(Htedv (Note: mention that an 
investigation was oondncted without any details 
should be rated "not reported/' Some dst^ on the 
nature of the investigation and/or its results are re* 
quired,) 

Expert judgment-^Etpvta' ophiiona are used to 
establish test ikons' appropriateiieas. 

Response date^Empitical studies (that include giv* 
ing teat items to examinees) were conducted in order 
to establish test items' aptuopriateneas. 

b) Vocofiulsry, Brevttyt Cbrity (A subiective evabia* 
ticm of test items' nppropriataneaa in terms of their 
vocabulary, brevity, and darily) 

Inappropriatft^Moat items are inappropriate* That 
Is, the items use vocabulary that Is too dICScuIt or 
easy; the items are needlesaly kmg; the iiama are 
mideadi^g; or there is no ocnmection between the 
it«n stems and answtta, 

ApptoprlatS'-Moat ttents are appropriate* That is^ 
the vocabulary used matches the intended examinees' 
educational level; ^itmis are not too hmg and con* 
tain imi^ relevant informaticm; arid there is a simple 
connection between the item stents and answers. 

c) Tasks Required of Examinees (A subjective evalua* 
tion of tlw ability of the designated flKsminaes to 
accomplish tasks required to complete the test items*) 

Inappropriate— Most items involve tasks thatare too 
easy or difficult for examinees. 

Appropriate— Most items involve tasks that exam* 
inees ^uld be able to acoomplish, 

2. msraucnoHS 

a) VocabuXary, Brevity, Clarity (A subjective evalua* 
tion of the ^istrucdons, eitfaar read to or by exam* 
inees, in terms of th^ vocabulaiy, brevity, and 
darity) 

Inappropfiate^Inatructions are usually inappro- 
priate* lliat is* the vocabulary used is too easy or too 
difficult tor euoninees* they are needless^ long, or 
they are misleading and confusing. 

Appropriate— Instructions are usually apptopnate* 
That is, the vocabulary used matdws examinees' 
educational levd; tb«y are brief and easily under* 
stood. 

b) Illustrative Sample Items (The inclusion of sample 
test items in the ins^ctkms) 

Not present-^No sample items provided* 

Present but not clarifying— Sample items are pro' 
vided* bet are not lepresttitative of the items. 

Effective and darifying^Sample items are pro* 
vided that accurate^ represent the tasks required by 
the test items. 



20 



29 



< 



2, Examinee AfP'Opriateness 


LTEST ITEMS 


Study of Test Item^s Appropriateness 
VcKiabuIary^ Brevity^ Clarity 
Tasks R«quiT«d of Examinees 


a Not reported 

□ Inappropriate 

□ Inappropriate 


□ Expert jud^ent 


□ Response data 

□ Appropriate 

□ Appropriate 


2. INSTRUCTIONS 


VcKiabulary^ Brevity^ Clarity 
Illustrative Sample Items 


□ Inappropriate 

□ Not present 


□ Present but not 
clarifying 


□ Appropriate 

□ Effective and 
clarifying 


3. FORMAT 


Test Page Layout 
Illustrations and Print 
Auditory Presentation 


□ Complicated 

□ Unclear 

□ Garbled 




□ Clear 

□ Clear 

□ Cle^r 


4. TIMING 


Timing and Pacing f 


□ Inappropriate 


□ Appropriate 


No guidelines 


5. RECORDING 
ANSWERS 


Response Scheme 


□ Complicated 




■ ^ 
□ Single 



30 



ERLC 



0 



3> roiMAT (ApnropriaUne98 of the fonnatting of the cit 
materials) 

a> Test pQg§ Layout (A subjective evaluation of the 
aitangement of written materials on a test page) 
Complicated— Below average quality; crow<^ and 
con^siiig fonnat 

Clear— Average or above average quality; clear for- 
mat 

b) lUmtrations and Print (A subjective evaluation of 
the clarity of print and Uluatrations) 

Unclear— Below average quality; difficult to follow 
and con^sing 

Clear— Average or above average quality; readable; 
realistic; up-to-datet and bold 

c) Auditory Presentation (A subjective evaluation of 
the clarity and ^ae of understanding of oral preaenta* 
tions) 

Garbled— Below average quality; garbled presenu* 
tion using slang and/or poorly paced 
Clear^Average or above average quality; easily 
understood presentation with no slang and wdt* 
paced 

4. TQIING 

Timing and Pacing (A subjective evaluation of the 
'^timing guidelines) 

Inappropriate-- Suggested timing and pacing tech- 
niques are usually inappropriate for the designated 
examinees' educational level or amount of time avail- 
able for testing. 

Appro^,. late— Suggested timing and pacing tech* 
niques are usually appropriate for examinses. 

No guidelines— No timing and pacing guidelines are 
given* 

5* RECORDnra ANSWBBS 

Response Scheme 

Complicated— The procedure used to record answers 
to the test items is difficult to use and likely to be 
con^sing to examinees. 

Simple— The procedures used to record answers are 
simple and easy to use (multiple choice or fUl-ins). 



3. ADMINISTRATIVE USABILITY 

I* AOHINISTRATIOH 

a) Size of Tes^ng Group 

0 {!) Individual on/y--Test(s) must be administered 
on an individual basis. 

2 (2) Groups only— Testis) may be administered to 
small groups (fewer than 30). 
2 i2) Individual or group— Testis) may be admin- 
istered to large groups (more than 30). Group admin- 
istration includes a single cassette that can be used 
by many students simultaneously. 

22 

32 



b) Administrator 

0 (0) Spe€iaU$t*^Otly a spedaliat (such as a psychol- 
ogist) may administer the car. 

1 ii) ScAoo/ penonne/— Teachers oi- classroom aides 
may adminbter the car. 

1 (2) Seif'Odmimstration—ThB car can be adminis' 
tered without assistance from teachers or others. 
This includes car systems with audio -cassette test 
- dtrecttons. 

c) Administration TVme 

Shortest possible time— Shortest time recommended 
for examinees to complete any single test 

LfOngest possible time— LfOngest time recommended 
for examinees to complete any single test. 

No time limits^No limits on testing time are pro^ 
vided. 

d) Directions 

Not available- There are no directiona to be read to 
or by the ^caminee. 

Available, incomplete— Directions do not cover all 
aspects of test taking; a standardized system for test 
taldng is not guaranteed. 

Available^ complete— Directiona cover all asp«x±s of 
test takings and a standardized system is ^tab* 
lished. 

e) Equipment 

Special equipment needed— Special materials other 
than papar and penetts needed for test administra* 
tion (such as cassettes and video equipm^t) 

No special equipment needed— Only paper and pen* 
oils are needed for t^ administration. 

f) Order for Testing (The order in whkh the objectives 
measured by the car must be tested) 
Prescribed sequence^Objectives must be tested in a 
prescribed order (this is frequently the case, for ex* 
ample» with car systems designed for use with a 
specific curriculum or with cars that test all objec* 
tives on a single form). 

General guidelines— An order for U ^ting objectives 
is recommended but not mandatory (this is frequent^ 
the case when objectives are structured in a hier* 
archy. 

None— There is no prescribed order in which objec- 
tives must be tested. 



2. ADAPTABILITY (The extent to which the cax system can 
be modified by the user) 
Modification of ckt by User (Guklelines provided by 
the publisher for altering or modifying various ee- 
pects of the crt) 

No— No modifications are permitted, or noguidelines 
are given. 

Yes, limited— G uidelines are given for making limited 
changes to the crt. 

Yes, extensive— Guklelines are given for making ex* 
tensive changes to the ckt. 



3; Adminisiraiive Usability 


L ADMINISTRATION 


Size of Testing Group 
Administrator 
Administration Time 

Directions 
Equipment 

Order for Testing 


0(1) U Individual only 
0(0) OSpecialist 

0 Shortest possi^^ie 
t'me: 

□ Not available 

□ Special equipment 
needed 

□ Prescribed sequence 


2(1) QGroupsonly 

1 ( 1 ) □ School personnel 

0 Longest possible 
time; 

□ Available, incomplete 

□ No special equipment 
needed 

□ General guidelines 


2(2) Q Individual or group 
1(2) USeif-administration 

□ No time limits 

□ Available, complete 

□ None 


2, ADAPTABIUTY 


Modification of CRT 
by User 


□ No 


□ YeS) limited 


□ Yes, extensive 


3, SCORING 


Ease of Scoring 


□ Subjective 


□ Objective 




4, INTERPRETATION 


Score Interpreter 

ScoreJfiterpretation Guide 
Score Ifiterpretabitity 


□ Specialist 

□ Not available 

□ Complicated, unusual 


□ School personnel 

□ Available, complicated 
0 Simple, standard 


0 SelMnterpreting 

□ Available, not 
complicated 


5, DECISIONS 


Decjsion*Making UtiUty 
Curriculum Referejciiig 


□ No guidelines 
0(0) ONone 


□ Yes, limited 
0(1) □ Kes, some 


□ Yes, extensive 
0(2) □ Yes, extensive 



3. SCOltNO 

Eas€ <^ Scoring 

Subjective— Test scores an not assigned using a 
standardized set of ruks and procedures and can be 
considered a Unction oi scorer's discretion. 

Objective— Objective scoring system that is stan- 
dardized 

4. INTSKPRITATION 

a) Score Interpreter 

Specialist— A specialist b required to intarpret the 
car's scores. 

School personnel— Teacher or classroom aides can 
interpret car's scores. 

Self-interpretuig— Test forms include a scoring 
mechantsm (carboo^backed test form with scoring 
key and interpretation guide). 
h) Score^Interpretation Guide (Interpretation guides^ 
in various forms* which permit correct and consistent 
interpretation of a car's scons) 
Not available^ No guides or directions for inter^ 
preting scores are provided. 

Available* complicated— Guides iare pzovkled to 
assist in the interpretation of scoFe|^but are difGcult 
to understand or have inadeqdH^anstn^tions. 
Available, not compUcated-^^fifeh un easy-to- 
use interpretation giiidelined^^^^ded. 
c) Score interpretabHity 

Complicated, unusual— Interpretation systenss are 
not commonly used and/or require numerous tables 
and mathematical conversions. 

Simple, standard— Interpretation syetems are gen^ 
erally understood and easily used. 

5. DBCtSlONS 

a) Decision-Moking VtiUty (The usefulness of guide- 
lines provided by the publisher that describe how to 
use the ci;r results to make educational decisions) 
No guidelines— Guidelines for rules for decision 
making are not provided. 

Yes* limited— Guidelines are provided, but they are 
vague* incompletei or not particularly relevant to the 
test's stated purposes. 

Yes, extenswe— Guidelines ^ provided that are 
complete, clearly defined* and relevant to the test's 
stated purposes. 

b) Curriculum Referencing (A guidSi usually organized 
by objective, linking each of the car's objectives to 
specific components of major instructional programs, 
is provkied.) 

0 (0) Noite—The publisher provides no system of cup 
ricuhim referencing. 

0{1) YeSf 5ome-'The publisher provides areferencing 
system that is limited in that not all the car's objec- 
tives are referenced and/or ooiy a smaU number of 
instructional programs are inchided. 



0 Yf^t ejctens^ue^The publish^ pvxjvidee a refer* 
enctng system that includes all the car's objectwes 
and most of the nugor instructkmat programs. 



4. FUNCTION AND PURPOSE 

1. puarosa Crhe purpose of the cur suggested by the 
publisher) 
Test Vsfs 

0 (i) ZHognosis— The car i« used to klentify difficul- 
ties with specific learning objectives, tasks, and/or 
behaviors. 

0 ii) Placement—The car is meA to locate tbe exam- 
inee's position in a currieutum or learning hierarchy, 

1 (2) Acftievement/progress— The car is used to 
measure achievement of specific teaming objectives, 
tasks* and/or behaviors. 

2 (i) Achievement/outcomes— Jh^ car is used to 
measure the outcomes of instnictton and/or the ex* 
tent to which an educatwnal program's objectives 
have been achieved. 

2{0) Comparison of i!istntcUz:ialprograms~-The car 
is used to compare two or more educational pro- 
grams. 

Other* please specify^Name other uses suggested 
for citT. 



6. OBJECTIVES DEVELOPMENT 

1. nouADi (oBocTiwe) sractncATtoN 

a) Domoin Definition (How the domain or the organized 
set of objectives ?neasured by the car is defined) 
None^Domain is not reported and/or defined. 

Vague, general— Domain is defined in unclear or in 
very general terms. 

Specific— Domain is de^ed clearly and in detail, 
bl Domain Structure 

Content area— Domain structure is not clearly speci- 
fied (for example, the domain is oniy linked to a broad 
content area). 

Content/process matrix— Domain is structured by a 
conteni/prooess matrix that defines the knowledge 
that will be assessed and the ways in which it wilt 
be measured. 

Objectives/item generation foriaat^ Formal, repli- 
cahle rules are given for generation of items and/or 
test items. 

2. objectives' CHABACTBSISTlCS 

a) Organization <^ Objectives 

None— A complete list of objectives that define the 
car for the grade/level being reviewed is not pro- 
vided. 



24 



35 



4: Function and Purpose 


K PURPOSE Jar£all that apply} 

1 


0(1) uDiagnosis 

2{]) U Achievement jotttcomes 

□ Other, please specify: 


0(1) UFkcemem 

2(0) O Girtipjr/wi of instmction 
programs 


1(2) □>Jc/i/^i?<'me/ii/pro^jj 



5: Objectives Development 


I. DOMAIN (OBJECTIVE) 
SPECIFICATION 


Domain Definition 
Domain Structure 


□ None 

□ Content area 


□ Vague, general 

□ Content/process matrix 


□ SpeciTie 

□ Objcciives/liem 
generation forniat 




2. OBJECTIVES' 
CHARACTERISTICS 


0%3ni^3rion Objccrives 

Level of Generality of 
Objectives 


□ None 

□ Gener:(l 


□ Simple list (no structure) 

□ Specific 


□ Categories (strands) 

□ Very detailed 


□ Hierarchy 


3. MATCH TO 
INSTRUCTION 


Curricuhm Match 


2(0) DAW 




0(2) □ Extensive 





36 



37 



Simple tiat (no structure}-— A list o^ objectives that 
' define the cit is given, but the objectives tcre net 
structured or orgK^ured in uiy specific fssh^ii. 
Categories (str«nds)-A list of the objectives dcflit- 
ing the ckt is provided and the objectives are orga 
nized into major skill areas or str&nds iut reading* 
two strands that might used to organize the ob- 
jectives are comiaeh^sion ^nti vocabulaty). 

Hierarchy— A list of the objectives that tiefiDefhea.r 
is provide with the objectivt^ organized within cate* 
gories into a him-rchy of skills/tasks, 
b} Level of Gtn^roUty of Obje^Mves (How broadly or 
narrowly objectives are stated) 
General Very global ststVm^n^s cove/ a wide range 
of content* skilb* and behavior. 

Specific-Scatements cle&tXy cefine the skill or 
knowledge being assessed but are not ft.^ -"^pociftc as to 
constitute behavioral objectives. 

Very detail^^Objedives are stated in d^t^ or in 
behavioral terms. 

3. uAtm ^ IflSTlUCnoH 
Curricitlum Match 

2 {0) //one— The car system b not df^igned tor use 
with a specif instructional program. 

1 U) Some— The cit system is not necessan^y depen- 
dent on the skills or context of an instructional pro- 
grvm. However* it may be mor^ npprcpriately ^zs^ 
with certain types of progrcims (for exampb* a car 
may be dev^oped from several instructional pro^ 
grams and reflect the bias of tbese programs* or the 
CIT might emphasize terminology and noioenclature 
used in only some programs). 

0 (2) Extensive— ThB car. its objectives, and test 
items are dependent on a particular curnculum or set 
of instructional materials and techniqfues. 



6. ITEM DEVELOPMENT 

I. [TOM^oaJECnVE lELATION 

a) Items 0>ded to Objectives 

0 iO) /Vo— Items are not referenced to a specific ob- 
jective(s). 

2 (2) Yej— Each test item is referenced to a specific 
objective(s). 

b) Number of Itrms per Objective {The minimum* nwut- 
imum* and average number of items used to test each 
objective) 

c) Scope of Coverage of Objectives 

Poor— In general, test items do not adequately cover 
the range of behaviors* contents* situations* aiid/or 
skills that are associated with the objectives being 
tested. 

Good— Most test items adequately cover the rangeof 
skills, behaviors, contents* and/or situations asso' 
ciated with the objective being tested> 

26 



2. ITEM CESIUUTION 

Rules for Item Wriung (A procedure or set of rules 
5cr writing tost iteoisl 

None^No system/rules were used (or reported) to 
guide item writing. 

Suggestions— 3ome very general rules were provided 
(and reported) t^ guide item writing (all items must 
be multiplv choice* for example). 

S^^ifications^Comprehf^nsive* detailed system/ 
rules were providod (and reported) to guide item 
writtntf. Sudv rules should limit the kinds of items 
used ^ measure objective (define appropriate con - 
Uat and fonnat* lor example). 



7. METHODS OF SCORE INTERPRETATION 

? . CRfTIUON-aCPtaCHCBD 

Scorers Reported in Tertns of Level of Performance 
(Criterion-referenced t^t scores for individuals and 
groups m^t be presented in terms of tne level of 
competency or mastery of the specifk: objectives on 
which the car is bssed. Tlie distinctive feature of a 
CRT 5coT€ must, therefore, lie in its emphasis on 
des'irtbing the absolute rather then tha relative level 
of performance i^*:th respect to an objective or skill.) 
Some of the different iinds of cmr ocores indude: 
0 iO) Item di//Scuffv— Thij represents the percentage 
of examinees or gr£»)jps who "pass" each item; that 
is* the item*s difficulty. 

0 {dt Actual st*ore -This is tbe n:imber or percent of 
corroct items on o given objective* referring to the 
number of items actually passed on the teat. 

1 U) True score— This indtcate±f an individual's or 
group s true level of performance on an objective* 
referring to the portion of the total universe of items 
foranohjeciivethatan individual or group could an* 
swcr correctly. (That is* if every possible iU^m was 
tested, this score is the number of items that an indi- 
vidual or group would pass.) 

0 iO) Arbitrary m^t^fery—This refers to whether an 
individual or group has achieved a pre-set hut arbi' 
trarily defined lev»l of perfb-mant*. 

J {!) Empirical mastery^Thm reiek^s to whether an 
individual or group has achieved a pre*set criterion 
ievel of performance where the criterion level is edu- 
cationally meaningful and empirically justified. 

1 (J) Achievement lev^ after a certain amount of 
learning time—Tr^ reports the time it takes class 
hours or cal^dar days) for an examineo or group to 
achieves given performance level. 

I (Jl Probability vf achieving next «ve/^This refers 
to the probability that the examinee is ready to begin 
the next !eve) of instruction (this may be based on 
both the number of items corr^t and the patterns of 
answers given to these items). 



38 







6: Item Development 






MTEM-OBJECTIVE 
RELATION 


Items Coded to Obfeciives 
Number of hems per Objective 

Scope of Coverage of Objectives 


0{0)n No 

□ Not applicable-Items 
not coded to objectives 

□ Poor 


2(2) □ Yes 

□ Range: to 

items/objective 

□ Good 


□ Average: 


2JTEM GENERATION 


Rules for Item Writing 


□ None 


□ Suggestions 


□ Specifications 



7: Methods of Score Interpretation 


t, CRITERION- 
REFERENCED 


Scores Reported in 
Terms of Level of 
Performance 
(Check ail ttiat 
apply} 


0(0) n item difficulty 

0(0) n Arbitrary mastery 

1(1) □Achievement level 
after a certain 
amount of learning 
time 


0(0) uActual score 

(Percent of items 
correct) 

I ( 1 ) □ Empirical mastery 

1(1) □ Probability of 
achievftv next 
level 


1(1) m^uescore 
(Percent of 
objective 
achieved) 


2, NORM-REFERENCED 


Comparative Scores 
Reported as well as a 
Criterion-Referenced 
Score (Check all 
that apply) 


* fl ( I ) □ Statidard score 
scaies 


1(1) UAgelp-ade 
equivalents 


1(1) UpercentHes 



40 



0>mpamtiv€ Scor$$ H^parud q$ WM as a Cnterhn* 
.Ihpnmo9i "Scor9 (lodividual'a or groups' acoree 
mnst be intaipnted ill rvtition to 
individuilB or groups who hive Uicm or who mlgnt 
take the t«st.) 

1 H) Standard $et>r§ $cak$ end age/grcd^ gguiv* 
obfict— Theis describe in iDdividtiil'f or group s 
«Kpec(ed pcrfonnuct et glveo grade levels. 
1 U) p€rc^ntiUs^Thm^ describe ^vte in terms of 
the rankkig or peroeDtage of indrndfib wh^ 
faB betow a gwfla acoie. 



8. ANALYSIS AND VAUDATIQN 

1. niLD tiar (A field testof tbepubliahed veraion of the 
ciT ^a(«m in whkh eiamitiee teipoiias data aie used to 
eetablUh tfas leUabiUty and vaGdUy of the itcaa and 
test qratttn. Field testa should not be ccnfosed with 
pilot terta of unpubKahed, ptaliminaiy w^cldog drafts 
of the ciT fl^ystesn.) 

a) Fktd Test JUporUd 

No«-Field test was not conducted, or Geld test was 
not documented. 

Yes— FieJd-test methods and results are reported 
with some detail. _ ^ 

b) Scak 

SmaB— Field te^ involved just one or two schools or 
a amgle school district. 

Moderate— Fkld test involved severst school dis- 
trkts. 

Large— Field test invoived students from many 
school districts. 

Looi-* Fkid test was restricted toone city or county. 

Regional— Field teat inx^otved a specific regkm of the 
ctmntry. 

National— Field test sites are geographically 
sentstive of the natioa. 

d) Seope-Ethnie 

Little ethnic tepresentation— Minority groups are 
not indoded in sufficient numbers in fidd teat (can* 
not nksasure with oon&leiice the relative performance 
of minori^ groups). 

Ethnic representation-* Minority groups aie inchided 
hi sofScient numbers in field test 

e) Sample JUpr^mtativenesM 

No sampling plan— Participants in fidd test were 
selected without any predetamined sampling plan 
{schools volunteered ttft field testing)^ 
Probability sampling plan-^Participanta in the field 
test were aelected using random sampling or random 
stratified sampling. 

Special samplhig tnon*probaUUty)-*Parttcipant8 in 
Geld test were sdected using a fystematic but non* 
probabilistic plan. 



2. rrxM ^Aurr/jtnuviNTJUf pata* 

' a) /tsm^Ob/sc^'CbntfrifSftce (The extent to which an 
itMu nMsuiee ^ rdaivaht objective) 
Not i^jwrted— No judgmental data on item-objective 
congnimce ate rq^orted* 

A^KNied (Give valas)>S<»De judgmental data <m 
n£am*objective congnience are reported* 

b) Cuitural Bia$' {Thd eiiBteDoe of dfystematic diCfer^ 
antes in pedivrmance on an item acr^ difCeient 
turalgnmpa) 

Not reported— No judgmental data <m item-objective 
congruence are reported. 

Rqwrted {Give vahie)-*S(»ae judgmeulal data on 
itenhobjective conghienoe aier^K)rted. 

3. rmi^AitTr/aBnKsa oita . 

a) Sknsitit^ to Imtructhn (An item's ability to die* 
criminate between thoao who have and have not 
benefited from instruction) 
|Not reported— No esttmats of sensitivity to instmc- 
Wen besed.on re^wnsa date is reported. 

Reported (Give vaIue)*-Some estimate of sensitivity 
to inaliuctidn is aeported* 
b^ jgj^Mca^The intsmat oonsiBtency ^ item 
IH^Hd with a particular objective; the extent to 
wmBTtM item behaves shnOarly to other items 
maMfag'the same ot^ectiv^ 
^jgoBbrtod— Noeati instruct 
^ mk^ased on reqKmse data ia reported* 

Reported {Give value)- Some estimate of aensitivity 
to instroctkm is reported* 

c) Stability (The ettent to which perfonnanoe on an 
item remains constant over time) 

Not reported— no estimate of sensithdty to instruc* 
tion based on reeponse data is reported. 

Reported {Give value)^Soma estimate of aensitivi^ 
to instruction is r^orted. 

d) Cultural Bias 

Not reported— No estlmata of sensitivity to instruc- 
tion based on reeponss data is reported. 

Reported (Give value)— Some estimate of sensitivity 
to instruction is reported. 

4. nsT q^AtrrT/juDOEiHTaL data 

a) Test Objective Qmgruenee (The extent to which a 
testmeasures tberelevant objective; oontentvalidity) 

0 (0) J^or r«poned«-No judgnmttal data on item* 
objective oHigruMkce are reputed. 

1 (1) Jt^rted iOive tra&M)— Some judgmental data 
on ttem*objective congruence are reported. 

b) Cultural Bias (The existence of ^stematic dif* 
ferences in performance on the test across cultural 
groups) 

"DtCGmit in«tlbodi caabe u**d 
lumaikd tht quality of tbe total Uet A dMfnctioQ k 
ibt Idnd^ of data ttdanwQtil ead n^peiiM) u**d to detmte 
tort quaUl^y. Jttdypantol dato rafcr to rsvim of tba tort matariak by 
«x|Metc and oCImt panooa who migfrt ttflt iratam. 
to the uaa of partidpanto' aema from Add t««ia of Uia cat matatialt. 



8. Anaiysfe and Validation . - ^ 


K FIELD TEST 


Field Test Reported 
Scale 

Scope^/eographic 
Scope^ Ethnic 

Sample Representativeness 


□ No 

□ Small 

□ Local 

□ Little ethnic representation 

□ No sampling plan 


□ Yes 

□ Moderate 

□ Regional 

□ Ethnic representation 

□ Probability sampting plan 


□ Large 

□ National 

□ Special sampting (non^probability) 


X ITEM QUALITY 
Oudgmental data) 


hem-Objective Congruence 
Cultural Bias 


□ Not reported 

□ Not reported 


□ Reported (Give value: ) 

□ Reported (Give value: ) 




3. ITEM QUALITY 
(response data) 


Sensitiviiy to Instruction 

Equivalence (item-objective 
internal consistency) 
^taouity 

Cultural Bias 


□ Not reported 

□ Not reported 
u woi reported 

□ Not reported 


□ Reported (GWe value; ) 

□ Reported (Give value: ) 
u Keported (^uivevuue: ) 

□ Reported (Givevatue: ) 




4. TEST QUALITY 
Oudgmental data) 


Test'Obfective Congnience 
{content validity) 

Cultural Bias 


0(0) □ Not reported 
0(0) O Not reported 


1(1) n Reported (Give v<iue: ) 
1(1) n Reported (Give value: ) 




(response data) 


Smsiiivity to instruction 

Equivalence ( internal con* 
sistency by objective) 

Stability (ttst-retestj 
attemate forms) 

Siabitity (number of 
items per objective) 
Criterion Validity 

Cultural Bias 


0(0) O Not reported 
0(0) O Not reported 

0(0) O Not reported 

0(0) O Not reported 

0(0) O Not reported 
0(0) □ Noi reported 


1(2) n Reported (Give value: ) 
2(2) □ Rtparied (Give vatue\ ) 

2(2) □ Reported (Give value: ) 

2(2) O Reported (Give value: ) 

2(2) O Reported (Give vaiue: ) 
2(2) O Reported {Give value: ) 





42 



0 (0) Not rtporttd--No judgmetttal dtw on it«m* 
objective congntttic6 are repotted. 

1 il) Reported ipive t/o^^)— Some judgmental data 
on item-objectiye coognienoe are reported* 

5. TUT QUAUTT/lterOt^ OATA 

a) SensititTtty to/ffsmictwmfAteet'dability todiectim- 
mate betwe^ those who have and those who have not 
benefited from inatniction) 
0 (0) JVbt rvpo)t«e(*-No estimate of sensitivity to in- 
struction based on response data is r^rted. 

1(2) Reported (Give t;aJtte)^Soni6 estimate of sensi- 
tivity to instruction is reported- 
h) Equiuaience (Internal consistency, the extent to 
^ wtuchaUitandthatmeasuiea^ettobjectWebehave 
aimilarfy) 

0 (0) Not reported-^^o estimate of sensitivity to in- 
3tructioQ,based on response data if reported. 

2 i2i Report iOive t^oJue) — Some estimate of sensi- 
tivity to instruction is reported. 

c) Stabiiity (Tee^retest; alternate forms; the exteait to 
which test performance remains constant over time) 



0 iOi Not reported~-^o estimate of sensitivity to in- 
struction based on response data is reported. 

2 i2i ReportediOii^ vcUue)Sow estimate of sensi- 
tivity to instruction is reported. 

d) Stabiiity (Number of items per objective; a deter- 
mination of the number of items needed to obtain a 
stable score on an objective) 

0 (0) Not reported'-fio estimate of sensitivity to in- 
struction based on response data ic reported. 

2 i2i Reported {Give vaiue)—Some estimate of sensi- 
tivity to imttructiDn is reported. 

e) CVit^rfon VoiWtyfAdeterminationofthecriterionin 
tenns oi which car scores 4Ue reported) 

0 iOi Not reported— lio estimate of s^itivity to in- 
structton based on response data is repovted. 

2 (2) Beported iOive value)— Some estimate of sensi- 
tivity to bstruction is reported. 

f) Cultural Bias (The existenceof systematic differences 
in test performance across cultural groups; this can 
be measured by regression techniques) 

0 (0) Not Teported---Ho estimate of sensitivity to in- 
struction based on response data is repotted. 

2 {Zi Reported {Give t^oJue)— Some estimate of sensi- 
tivity to instruction is reported- 



44 



REFERENCES* 



1. Anderson* S.,«tdL Encyctopedia of educational evatu- 

atiofL San Fmicisco: Jossey*B&s3* ld75. 

2. Baker. E.L. Using meBsuretnent to improve instruc- 

tion. Pftper presented at Conventton of American 
Psychological Assodatton, Honolulu. Hawaii, 1972. 
ED 069 762. 

3. Baker. R.L. Measurement considmtions in instruc* 

tion product development. Paper presented at Con* 
ference on Problems in Objectives-Based Measure- 
ment. Ctoter for the Study of Evaluation . Univer. of 
California* 1972. 

4. Cronbacfa. L.J. Essentials of psychotogical testing. 

Ord ed.) New York: Harper, 1970. 

o. Cronbach. L.J.Test valklatton. In L.Thomdike(EdJt 
Educational Measurement (2nd ed.). Wa^ungton, 
D.C.: American Council on Education. 1971. 

6. Ebelt R.L. Evaluation and educational objectives: be> 

haviorat and otherwise. Paper presented at the Con* 
vention of the American Psychological Associattont 
Honolulu* Hawaii, 1972. 

7. GJaseTt R.. & Nitko. A. Measurement in learning and 

instruction. In R.L. Tbomdike (Ed.). Educational 
Measurement (2nd ed.). Washington, D.C.: Amer* 
ican Council on Education, 1971. Pp. 652-670. 

8. Harrist C. Comments on problems of obJecttves*based 

measurement. Paper presented at the annual meet* 
ing, American Educational Research Association, 
New Orleans* 1973. 

9. Harris. M .L. , and Stewart. D.M . Application of classi- 

cal strategies to criterion*refeT«nced t«st construc- 
tion. Paper presented at the annual meeting. Amen* 
can Educational Research Association. New York. 
197L 

\0. Hivety. W. Introduction todomain-referencedachieve* 
ment testing. Symposium presentation. American 
EducationalResearch Association. Minnesotat 1970. 

11. Hively. W.. Maxwell. G.. Rabeht, G.. Sension. D.. & 
LundL^.. S. Domain*referenced curriculum evalua- 
tion: a technical handbook and a case study from the 



*tum5 followed by an ED number (for «xf mpl«. ED 099 ^291 w available 
from th« EHIC Document Reproduction Service tEDHSI. Consult the 
modi recent i^ue of Resources in Education for the addites and ord^ng 
infoiniation 



MINNEMAST project. CSE Monograph' Series in 
Evaluation. Volume l.Cent^fortheStudy of Eval- 
uation, Univer. of California. Los Angeles, 1973. 

12. Hoepfner.R.^Comiiff. W. Jr.. Petroako, J., Watkiaa, 

Jm Erlich. O., Todaro^ R.^ and Hoyt^ M. CSE sec^ 
ondary school test evaluations: grades 7 and 8. 
Center for the Study of Evaluation, Univa. of 
California. Los Angeles^ 1974. ED 113 382. 

13. Keeslingf James W.ldentificattonoldifferingintended 

outcomes and their impltcations for evaluation. 
Paptf presented at the annual meeting* American 
Educational Research Association, Waahington, 
D.C., 1976. 

14. Klein, S.P. Evaluating tests in terms of the informa* 

tion th^ provide. Evaluation Comment^ 1970, 2, 
No. 2, 1<6. ED 045 699. 

15. Klein,S.P. An evaluation of New Mexico'seducational 

pzrcrities. Paper presented at Westm Psychological 
Association. Portland, Oiegon, 1972. ED 077 936. 

16. Mager* R.F. Preparing instructional objectives. San 

Francisco: Fearon Publishers, Inct, 19$2. 

17. Novickt M.R. and Lewis, C. Prescribing test length 

for criterion*referenc6d measurement. CSE Mono- 
graph No. 3. Center for the Study of Evaluation, 
Univer. ol Catifomiat Los Angelest 1974. 

18. Pophamt W.,&Husek*T.R. Implications of criterion* 

referenced measurement. Journal of Educational 
Measurement, 1969, No. 1, 1-9. 

19. Popham. W.J. Educational evaiuatiorL Englewood 

Cmfs, N.J.: Pt^ntice^Hall, 1975. 

20. Skager. R. Generating; criterton*referenced tests from 

objectives based assessment systems: unsolved 
problems in test developmentf assembly and inter 
pretatton. Paper presented at the annual meetingf 
American Educatbnal Research Association. New 
Orleans. 1973. 

21. Skag^r. R. Critical differentiating characteristics for 

tests of educational achievement. Paper presented 
at the annual meeting, American Educational Re- 
search Association. Washington. D.C. 1975. 

22. Wilson. H.A. A humanistic approach to criterion* 

referenced testing. Paper presented at the annual 
meeting* American Educational Research Associa- 
tion* New Orleans. 1973. ED 081 842. 



45 



